Lesson 2: Working with tabular data

Published

October 26, 2023

1 Prepare the R environment for this lesson

For this lesson we’ll need four packages:

  • palmerpenguins
  • here
  • readr
  • dplyr

First, we need to install the here package, using the install.packages() function.

install.packages("here")

Note, the readr and dplyr packages are part of the tidyverse, so we don’t need to install them separately. Now we use the library() function to load all of these packages.

library(here)
here() starts at C:/Users/nickopotamus/Projects/ITMAT_office_houRs
library(palmerpenguins)
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

The output from this code indicates R successfully loaded these packages. The message we get from the here package tells us which directory on our computers it’s using as the “root” directory (more on what this means below). The other warning messages explains that dplyr includes several functions (e.g. filter() and lag()) that have the same names as functions from the stats and base packages. R’s default behavior for resolving these naming conflicts is to use the version of the function loaded most recently into memory. The warning lets us know that if we run any of these functions, R will use the version of these functions from the dplyr package and not the stats/base package.

The potential for naming conflicts like this is another reason why we only load the packages we need for the current analyses.

RStudio’s Tab-Complete List Includes Package Names for Each Function

When we start to type a function’s name, and RStudio’s tab-completion prompt opens up, the package for each function is listed in curly bracers to the right of the function’s name. As we’re writing code, we can use this make sure we’re using the correct version of the function.

2 Reading data from files

We have a file named penguins.csv in the DATA/ directory. This data file is in .csv (comma-separated values) format and contains the same information as the penguins data frame we used in the previous lesson. We can view the raw contents of this file by using the Files tab in the lower right pane of the RStudio window. Navigate to the DATA/ directory, click on the penguins.csv file, then select the “View File” option.

2.1 Base R

We can read the data from this file using the base R function read.csv() and store the data in a variable named penguin_data_from_base_r.

penguin_data_from_base_r <- read.csv(file = here::here("DATA/penguins.csv"))

In R, <- is called the assignment operator. The assignment operator takes the output from the function to its right (read.csv()) and assigns it to the variable to its right (penguin_data_from_base_r). Put differently, we’re storing the contents of the ‘penguins.csv’ file in ‘penguin_data_from_base_r’.

The here package

The here() function is helping to point R to the specific location of the file. It allows us to define a file’s location relative to the project’s main directory (ITMAT_office_houRs, in this case). While the here() function is not required to read files, it can make our lives easier when we’re using RStudio’s ‘Projects’ to manage our work.

We can look at the contents of penguin_data_from_base_r by entering the variable name into the R console.

penguin_data_from_base_r
      species    island bill_length_mm bill_depth_mm flipper_length_mm
1      Adelie Torgersen           39.1          18.7               181
2      Adelie Torgersen           39.5          17.4               186
3      Adelie Torgersen           40.3          18.0               195
4      Adelie Torgersen             NA            NA                NA
5      Adelie Torgersen           36.7          19.3               193
6      Adelie Torgersen           39.3          20.6               190
7      Adelie Torgersen           38.9          17.8               181
8      Adelie Torgersen           39.2          19.6               195
9      Adelie Torgersen           34.1          18.1               193
10     Adelie Torgersen           42.0          20.2               190
11     Adelie Torgersen           37.8          17.1               186
12     Adelie Torgersen           37.8          17.3               180
13     Adelie Torgersen           41.1          17.6               182
14     Adelie Torgersen           38.6          21.2               191
15     Adelie Torgersen           34.6          21.1               198
16     Adelie Torgersen           36.6          17.8               185
17     Adelie Torgersen           38.7          19.0               195
18     Adelie Torgersen           42.5          20.7               197
19     Adelie Torgersen           34.4          18.4               184
20     Adelie Torgersen           46.0          21.5               194
21     Adelie    Biscoe           37.8          18.3               174
22     Adelie    Biscoe           37.7          18.7               180
23     Adelie    Biscoe           35.9          19.2               189
24     Adelie    Biscoe           38.2          18.1               185
25     Adelie    Biscoe           38.8          17.2               180
26     Adelie    Biscoe           35.3          18.9               187
27     Adelie    Biscoe           40.6          18.6               183
28     Adelie    Biscoe           40.5          17.9               187
29     Adelie    Biscoe           37.9          18.6               172
30     Adelie    Biscoe           40.5          18.9               180
31     Adelie     Dream           39.5          16.7               178
32     Adelie     Dream           37.2          18.1               178
33     Adelie     Dream           39.5          17.8               188
34     Adelie     Dream           40.9          18.9               184
35     Adelie     Dream           36.4          17.0               195
36     Adelie     Dream           39.2          21.1               196
37     Adelie     Dream           38.8          20.0               190
38     Adelie     Dream           42.2          18.5               180
39     Adelie     Dream           37.6          19.3               181
40     Adelie     Dream           39.8          19.1               184
41     Adelie     Dream           36.5          18.0               182
42     Adelie     Dream           40.8          18.4               195
43     Adelie     Dream           36.0          18.5               186
44     Adelie     Dream           44.1          19.7               196
45     Adelie     Dream           37.0          16.9               185
46     Adelie     Dream           39.6          18.8               190
47     Adelie     Dream           41.1          19.0               182
48     Adelie     Dream           37.5          18.9               179
49     Adelie     Dream           36.0          17.9               190
50     Adelie     Dream           42.3          21.2               191
51     Adelie    Biscoe           39.6          17.7               186
52     Adelie    Biscoe           40.1          18.9               188
53     Adelie    Biscoe           35.0          17.9               190
54     Adelie    Biscoe           42.0          19.5               200
55     Adelie    Biscoe           34.5          18.1               187
56     Adelie    Biscoe           41.4          18.6               191
57     Adelie    Biscoe           39.0          17.5               186
58     Adelie    Biscoe           40.6          18.8               193
59     Adelie    Biscoe           36.5          16.6               181
60     Adelie    Biscoe           37.6          19.1               194
61     Adelie    Biscoe           35.7          16.9               185
62     Adelie    Biscoe           41.3          21.1               195
63     Adelie    Biscoe           37.6          17.0               185
64     Adelie    Biscoe           41.1          18.2               192
65     Adelie    Biscoe           36.4          17.1               184
66     Adelie    Biscoe           41.6          18.0               192
67     Adelie    Biscoe           35.5          16.2               195
68     Adelie    Biscoe           41.1          19.1               188
69     Adelie Torgersen           35.9          16.6               190
70     Adelie Torgersen           41.8          19.4               198
71     Adelie Torgersen           33.5          19.0               190
72     Adelie Torgersen           39.7          18.4               190
73     Adelie Torgersen           39.6          17.2               196
74     Adelie Torgersen           45.8          18.9               197
75     Adelie Torgersen           35.5          17.5               190
76     Adelie Torgersen           42.8          18.5               195
77     Adelie Torgersen           40.9          16.8               191
78     Adelie Torgersen           37.2          19.4               184
79     Adelie Torgersen           36.2          16.1               187
80     Adelie Torgersen           42.1          19.1               195
81     Adelie Torgersen           34.6          17.2               189
82     Adelie Torgersen           42.9          17.6               196
83     Adelie Torgersen           36.7          18.8               187
84     Adelie Torgersen           35.1          19.4               193
85     Adelie     Dream           37.3          17.8               191
86     Adelie     Dream           41.3          20.3               194
87     Adelie     Dream           36.3          19.5               190
88     Adelie     Dream           36.9          18.6               189
89     Adelie     Dream           38.3          19.2               189
90     Adelie     Dream           38.9          18.8               190
91     Adelie     Dream           35.7          18.0               202
92     Adelie     Dream           41.1          18.1               205
93     Adelie     Dream           34.0          17.1               185
94     Adelie     Dream           39.6          18.1               186
95     Adelie     Dream           36.2          17.3               187
96     Adelie     Dream           40.8          18.9               208
97     Adelie     Dream           38.1          18.6               190
98     Adelie     Dream           40.3          18.5               196
99     Adelie     Dream           33.1          16.1               178
100    Adelie     Dream           43.2          18.5               192
101    Adelie    Biscoe           35.0          17.9               192
102    Adelie    Biscoe           41.0          20.0               203
103    Adelie    Biscoe           37.7          16.0               183
104    Adelie    Biscoe           37.8          20.0               190
105    Adelie    Biscoe           37.9          18.6               193
106    Adelie    Biscoe           39.7          18.9               184
107    Adelie    Biscoe           38.6          17.2               199
108    Adelie    Biscoe           38.2          20.0               190
109    Adelie    Biscoe           38.1          17.0               181
110    Adelie    Biscoe           43.2          19.0               197
111    Adelie    Biscoe           38.1          16.5               198
112    Adelie    Biscoe           45.6          20.3               191
113    Adelie    Biscoe           39.7          17.7               193
114    Adelie    Biscoe           42.2          19.5               197
115    Adelie    Biscoe           39.6          20.7               191
116    Adelie    Biscoe           42.7          18.3               196
117    Adelie Torgersen           38.6          17.0               188
118    Adelie Torgersen           37.3          20.5               199
119    Adelie Torgersen           35.7          17.0               189
120    Adelie Torgersen           41.1          18.6               189
121    Adelie Torgersen           36.2          17.2               187
122    Adelie Torgersen           37.7          19.8               198
123    Adelie Torgersen           40.2          17.0               176
124    Adelie Torgersen           41.4          18.5               202
125    Adelie Torgersen           35.2          15.9               186
126    Adelie Torgersen           40.6          19.0               199
127    Adelie Torgersen           38.8          17.6               191
128    Adelie Torgersen           41.5          18.3               195
129    Adelie Torgersen           39.0          17.1               191
130    Adelie Torgersen           44.1          18.0               210
131    Adelie Torgersen           38.5          17.9               190
132    Adelie Torgersen           43.1          19.2               197
133    Adelie     Dream           36.8          18.5               193
134    Adelie     Dream           37.5          18.5               199
135    Adelie     Dream           38.1          17.6               187
136    Adelie     Dream           41.1          17.5               190
137    Adelie     Dream           35.6          17.5               191
138    Adelie     Dream           40.2          20.1               200
139    Adelie     Dream           37.0          16.5               185
140    Adelie     Dream           39.7          17.9               193
141    Adelie     Dream           40.2          17.1               193
142    Adelie     Dream           40.6          17.2               187
143    Adelie     Dream           32.1          15.5               188
144    Adelie     Dream           40.7          17.0               190
145    Adelie     Dream           37.3          16.8               192
146    Adelie     Dream           39.0          18.7               185
147    Adelie     Dream           39.2          18.6               190
148    Adelie     Dream           36.6          18.4               184
149    Adelie     Dream           36.0          17.8               195
150    Adelie     Dream           37.8          18.1               193
151    Adelie     Dream           36.0          17.1               187
152    Adelie     Dream           41.5          18.5               201
153    Gentoo    Biscoe           46.1          13.2               211
154    Gentoo    Biscoe           50.0          16.3               230
155    Gentoo    Biscoe           48.7          14.1               210
156    Gentoo    Biscoe           50.0          15.2               218
157    Gentoo    Biscoe           47.6          14.5               215
158    Gentoo    Biscoe           46.5          13.5               210
159    Gentoo    Biscoe           45.4          14.6               211
160    Gentoo    Biscoe           46.7          15.3               219
161    Gentoo    Biscoe           43.3          13.4               209
162    Gentoo    Biscoe           46.8          15.4               215
163    Gentoo    Biscoe           40.9          13.7               214
164    Gentoo    Biscoe           49.0          16.1               216
165    Gentoo    Biscoe           45.5          13.7               214
166    Gentoo    Biscoe           48.4          14.6               213
167    Gentoo    Biscoe           45.8          14.6               210
168    Gentoo    Biscoe           49.3          15.7               217
169    Gentoo    Biscoe           42.0          13.5               210
170    Gentoo    Biscoe           49.2          15.2               221
171    Gentoo    Biscoe           46.2          14.5               209
172    Gentoo    Biscoe           48.7          15.1               222
173    Gentoo    Biscoe           50.2          14.3               218
174    Gentoo    Biscoe           45.1          14.5               215
175    Gentoo    Biscoe           46.5          14.5               213
176    Gentoo    Biscoe           46.3          15.8               215
177    Gentoo    Biscoe           42.9          13.1               215
178    Gentoo    Biscoe           46.1          15.1               215
179    Gentoo    Biscoe           44.5          14.3               216
180    Gentoo    Biscoe           47.8          15.0               215
181    Gentoo    Biscoe           48.2          14.3               210
182    Gentoo    Biscoe           50.0          15.3               220
183    Gentoo    Biscoe           47.3          15.3               222
184    Gentoo    Biscoe           42.8          14.2               209
185    Gentoo    Biscoe           45.1          14.5               207
186    Gentoo    Biscoe           59.6          17.0               230
187    Gentoo    Biscoe           49.1          14.8               220
188    Gentoo    Biscoe           48.4          16.3               220
189    Gentoo    Biscoe           42.6          13.7               213
190    Gentoo    Biscoe           44.4          17.3               219
191    Gentoo    Biscoe           44.0          13.6               208
192    Gentoo    Biscoe           48.7          15.7               208
193    Gentoo    Biscoe           42.7          13.7               208
194    Gentoo    Biscoe           49.6          16.0               225
195    Gentoo    Biscoe           45.3          13.7               210
196    Gentoo    Biscoe           49.6          15.0               216
197    Gentoo    Biscoe           50.5          15.9               222
198    Gentoo    Biscoe           43.6          13.9               217
199    Gentoo    Biscoe           45.5          13.9               210
200    Gentoo    Biscoe           50.5          15.9               225
201    Gentoo    Biscoe           44.9          13.3               213
202    Gentoo    Biscoe           45.2          15.8               215
203    Gentoo    Biscoe           46.6          14.2               210
204    Gentoo    Biscoe           48.5          14.1               220
205    Gentoo    Biscoe           45.1          14.4               210
206    Gentoo    Biscoe           50.1          15.0               225
207    Gentoo    Biscoe           46.5          14.4               217
208    Gentoo    Biscoe           45.0          15.4               220
209    Gentoo    Biscoe           43.8          13.9               208
210    Gentoo    Biscoe           45.5          15.0               220
211    Gentoo    Biscoe           43.2          14.5               208
212    Gentoo    Biscoe           50.4          15.3               224
213    Gentoo    Biscoe           45.3          13.8               208
214    Gentoo    Biscoe           46.2          14.9               221
215    Gentoo    Biscoe           45.7          13.9               214
216    Gentoo    Biscoe           54.3          15.7               231
217    Gentoo    Biscoe           45.8          14.2               219
218    Gentoo    Biscoe           49.8          16.8               230
219    Gentoo    Biscoe           46.2          14.4               214
220    Gentoo    Biscoe           49.5          16.2               229
221    Gentoo    Biscoe           43.5          14.2               220
222    Gentoo    Biscoe           50.7          15.0               223
223    Gentoo    Biscoe           47.7          15.0               216
224    Gentoo    Biscoe           46.4          15.6               221
225    Gentoo    Biscoe           48.2          15.6               221
226    Gentoo    Biscoe           46.5          14.8               217
227    Gentoo    Biscoe           46.4          15.0               216
228    Gentoo    Biscoe           48.6          16.0               230
229    Gentoo    Biscoe           47.5          14.2               209
230    Gentoo    Biscoe           51.1          16.3               220
231    Gentoo    Biscoe           45.2          13.8               215
232    Gentoo    Biscoe           45.2          16.4               223
233    Gentoo    Biscoe           49.1          14.5               212
234    Gentoo    Biscoe           52.5          15.6               221
235    Gentoo    Biscoe           47.4          14.6               212
236    Gentoo    Biscoe           50.0          15.9               224
237    Gentoo    Biscoe           44.9          13.8               212
238    Gentoo    Biscoe           50.8          17.3               228
239    Gentoo    Biscoe           43.4          14.4               218
240    Gentoo    Biscoe           51.3          14.2               218
241    Gentoo    Biscoe           47.5          14.0               212
242    Gentoo    Biscoe           52.1          17.0               230
243    Gentoo    Biscoe           47.5          15.0               218
244    Gentoo    Biscoe           52.2          17.1               228
245    Gentoo    Biscoe           45.5          14.5               212
246    Gentoo    Biscoe           49.5          16.1               224
247    Gentoo    Biscoe           44.5          14.7               214
248    Gentoo    Biscoe           50.8          15.7               226
249    Gentoo    Biscoe           49.4          15.8               216
250    Gentoo    Biscoe           46.9          14.6               222
251    Gentoo    Biscoe           48.4          14.4               203
252    Gentoo    Biscoe           51.1          16.5               225
253    Gentoo    Biscoe           48.5          15.0               219
254    Gentoo    Biscoe           55.9          17.0               228
255    Gentoo    Biscoe           47.2          15.5               215
256    Gentoo    Biscoe           49.1          15.0               228
257    Gentoo    Biscoe           47.3          13.8               216
258    Gentoo    Biscoe           46.8          16.1               215
259    Gentoo    Biscoe           41.7          14.7               210
260    Gentoo    Biscoe           53.4          15.8               219
261    Gentoo    Biscoe           43.3          14.0               208
262    Gentoo    Biscoe           48.1          15.1               209
263    Gentoo    Biscoe           50.5          15.2               216
264    Gentoo    Biscoe           49.8          15.9               229
265    Gentoo    Biscoe           43.5          15.2               213
266    Gentoo    Biscoe           51.5          16.3               230
267    Gentoo    Biscoe           46.2          14.1               217
268    Gentoo    Biscoe           55.1          16.0               230
269    Gentoo    Biscoe           44.5          15.7               217
270    Gentoo    Biscoe           48.8          16.2               222
271    Gentoo    Biscoe           47.2          13.7               214
272    Gentoo    Biscoe             NA            NA                NA
273    Gentoo    Biscoe           46.8          14.3               215
274    Gentoo    Biscoe           50.4          15.7               222
275    Gentoo    Biscoe           45.2          14.8               212
276    Gentoo    Biscoe           49.9          16.1               213
277 Chinstrap     Dream           46.5          17.9               192
278 Chinstrap     Dream           50.0          19.5               196
279 Chinstrap     Dream           51.3          19.2               193
280 Chinstrap     Dream           45.4          18.7               188
281 Chinstrap     Dream           52.7          19.8               197
282 Chinstrap     Dream           45.2          17.8               198
283 Chinstrap     Dream           46.1          18.2               178
284 Chinstrap     Dream           51.3          18.2               197
285 Chinstrap     Dream           46.0          18.9               195
286 Chinstrap     Dream           51.3          19.9               198
287 Chinstrap     Dream           46.6          17.8               193
288 Chinstrap     Dream           51.7          20.3               194
289 Chinstrap     Dream           47.0          17.3               185
290 Chinstrap     Dream           52.0          18.1               201
291 Chinstrap     Dream           45.9          17.1               190
292 Chinstrap     Dream           50.5          19.6               201
293 Chinstrap     Dream           50.3          20.0               197
294 Chinstrap     Dream           58.0          17.8               181
295 Chinstrap     Dream           46.4          18.6               190
296 Chinstrap     Dream           49.2          18.2               195
297 Chinstrap     Dream           42.4          17.3               181
298 Chinstrap     Dream           48.5          17.5               191
299 Chinstrap     Dream           43.2          16.6               187
300 Chinstrap     Dream           50.6          19.4               193
301 Chinstrap     Dream           46.7          17.9               195
302 Chinstrap     Dream           52.0          19.0               197
303 Chinstrap     Dream           50.5          18.4               200
304 Chinstrap     Dream           49.5          19.0               200
305 Chinstrap     Dream           46.4          17.8               191
306 Chinstrap     Dream           52.8          20.0               205
307 Chinstrap     Dream           40.9          16.6               187
308 Chinstrap     Dream           54.2          20.8               201
309 Chinstrap     Dream           42.5          16.7               187
310 Chinstrap     Dream           51.0          18.8               203
311 Chinstrap     Dream           49.7          18.6               195
312 Chinstrap     Dream           47.5          16.8               199
313 Chinstrap     Dream           47.6          18.3               195
314 Chinstrap     Dream           52.0          20.7               210
315 Chinstrap     Dream           46.9          16.6               192
316 Chinstrap     Dream           53.5          19.9               205
317 Chinstrap     Dream           49.0          19.5               210
318 Chinstrap     Dream           46.2          17.5               187
319 Chinstrap     Dream           50.9          19.1               196
320 Chinstrap     Dream           45.5          17.0               196
321 Chinstrap     Dream           50.9          17.9               196
322 Chinstrap     Dream           50.8          18.5               201
323 Chinstrap     Dream           50.1          17.9               190
324 Chinstrap     Dream           49.0          19.6               212
325 Chinstrap     Dream           51.5          18.7               187
326 Chinstrap     Dream           49.8          17.3               198
327 Chinstrap     Dream           48.1          16.4               199
328 Chinstrap     Dream           51.4          19.0               201
329 Chinstrap     Dream           45.7          17.3               193
330 Chinstrap     Dream           50.7          19.7               203
331 Chinstrap     Dream           42.5          17.3               187
332 Chinstrap     Dream           52.2          18.8               197
333 Chinstrap     Dream           45.2          16.6               191
334 Chinstrap     Dream           49.3          19.9               203
335 Chinstrap     Dream           50.2          18.8               202
336 Chinstrap     Dream           45.6          19.4               194
337 Chinstrap     Dream           51.9          19.5               206
338 Chinstrap     Dream           46.8          16.5               189
339 Chinstrap     Dream           45.7          17.0               195
340 Chinstrap     Dream           55.8          19.8               207
341 Chinstrap     Dream           43.5          18.1               202
342 Chinstrap     Dream           49.6          18.2               193
343 Chinstrap     Dream           50.8          19.0               210
344 Chinstrap     Dream           50.2          18.7               198
    body_mass_g    sex year
1          3750   male 2007
2          3800 female 2007
3          3250 female 2007
4            NA   <NA> 2007
5          3450 female 2007
6          3650   male 2007
7          3625 female 2007
8          4675   male 2007
9          3475   <NA> 2007
10         4250   <NA> 2007
11         3300   <NA> 2007
12         3700   <NA> 2007
13         3200 female 2007
14         3800   male 2007
15         4400   male 2007
16         3700 female 2007
17         3450 female 2007
18         4500   male 2007
19         3325 female 2007
20         4200   male 2007
21         3400 female 2007
22         3600   male 2007
23         3800 female 2007
24         3950   male 2007
25         3800   male 2007
26         3800 female 2007
27         3550   male 2007
28         3200 female 2007
29         3150 female 2007
30         3950   male 2007
31         3250 female 2007
32         3900   male 2007
33         3300 female 2007
34         3900   male 2007
35         3325 female 2007
36         4150   male 2007
37         3950   male 2007
38         3550 female 2007
39         3300 female 2007
40         4650   male 2007
41         3150 female 2007
42         3900   male 2007
43         3100 female 2007
44         4400   male 2007
45         3000 female 2007
46         4600   male 2007
47         3425   male 2007
48         2975   <NA> 2007
49         3450 female 2007
50         4150   male 2007
51         3500 female 2008
52         4300   male 2008
53         3450 female 2008
54         4050   male 2008
55         2900 female 2008
56         3700   male 2008
57         3550 female 2008
58         3800   male 2008
59         2850 female 2008
60         3750   male 2008
61         3150 female 2008
62         4400   male 2008
63         3600 female 2008
64         4050   male 2008
65         2850 female 2008
66         3950   male 2008
67         3350 female 2008
68         4100   male 2008
69         3050 female 2008
70         4450   male 2008
71         3600 female 2008
72         3900   male 2008
73         3550 female 2008
74         4150   male 2008
75         3700 female 2008
76         4250   male 2008
77         3700 female 2008
78         3900   male 2008
79         3550 female 2008
80         4000   male 2008
81         3200 female 2008
82         4700   male 2008
83         3800 female 2008
84         4200   male 2008
85         3350 female 2008
86         3550   male 2008
87         3800   male 2008
88         3500 female 2008
89         3950   male 2008
90         3600 female 2008
91         3550 female 2008
92         4300   male 2008
93         3400 female 2008
94         4450   male 2008
95         3300 female 2008
96         4300   male 2008
97         3700 female 2008
98         4350   male 2008
99         2900 female 2008
100        4100   male 2008
101        3725 female 2009
102        4725   male 2009
103        3075 female 2009
104        4250   male 2009
105        2925 female 2009
106        3550   male 2009
107        3750 female 2009
108        3900   male 2009
109        3175 female 2009
110        4775   male 2009
111        3825 female 2009
112        4600   male 2009
113        3200 female 2009
114        4275   male 2009
115        3900 female 2009
116        4075   male 2009
117        2900 female 2009
118        3775   male 2009
119        3350 female 2009
120        3325   male 2009
121        3150 female 2009
122        3500   male 2009
123        3450 female 2009
124        3875   male 2009
125        3050 female 2009
126        4000   male 2009
127        3275 female 2009
128        4300   male 2009
129        3050 female 2009
130        4000   male 2009
131        3325 female 2009
132        3500   male 2009
133        3500 female 2009
134        4475   male 2009
135        3425 female 2009
136        3900   male 2009
137        3175 female 2009
138        3975   male 2009
139        3400 female 2009
140        4250   male 2009
141        3400 female 2009
142        3475   male 2009
143        3050 female 2009
144        3725   male 2009
145        3000 female 2009
146        3650   male 2009
147        4250   male 2009
148        3475 female 2009
149        3450 female 2009
150        3750   male 2009
151        3700 female 2009
152        4000   male 2009
153        4500 female 2007
154        5700   male 2007
155        4450 female 2007
156        5700   male 2007
157        5400   male 2007
158        4550 female 2007
159        4800 female 2007
160        5200   male 2007
161        4400 female 2007
162        5150   male 2007
163        4650 female 2007
164        5550   male 2007
165        4650 female 2007
166        5850   male 2007
167        4200 female 2007
168        5850   male 2007
169        4150 female 2007
170        6300   male 2007
171        4800 female 2007
172        5350   male 2007
173        5700   male 2007
174        5000 female 2007
175        4400 female 2007
176        5050   male 2007
177        5000 female 2007
178        5100   male 2007
179        4100   <NA> 2007
180        5650   male 2007
181        4600 female 2007
182        5550   male 2007
183        5250   male 2007
184        4700 female 2007
185        5050 female 2007
186        6050   male 2007
187        5150 female 2008
188        5400   male 2008
189        4950 female 2008
190        5250   male 2008
191        4350 female 2008
192        5350   male 2008
193        3950 female 2008
194        5700   male 2008
195        4300 female 2008
196        4750   male 2008
197        5550   male 2008
198        4900 female 2008
199        4200 female 2008
200        5400   male 2008
201        5100 female 2008
202        5300   male 2008
203        4850 female 2008
204        5300   male 2008
205        4400 female 2008
206        5000   male 2008
207        4900 female 2008
208        5050   male 2008
209        4300 female 2008
210        5000   male 2008
211        4450 female 2008
212        5550   male 2008
213        4200 female 2008
214        5300   male 2008
215        4400 female 2008
216        5650   male 2008
217        4700 female 2008
218        5700   male 2008
219        4650   <NA> 2008
220        5800   male 2008
221        4700 female 2008
222        5550   male 2008
223        4750 female 2008
224        5000   male 2008
225        5100   male 2008
226        5200 female 2008
227        4700 female 2008
228        5800   male 2008
229        4600 female 2008
230        6000   male 2008
231        4750 female 2008
232        5950   male 2008
233        4625 female 2009
234        5450   male 2009
235        4725 female 2009
236        5350   male 2009
237        4750 female 2009
238        5600   male 2009
239        4600 female 2009
240        5300   male 2009
241        4875 female 2009
242        5550   male 2009
243        4950 female 2009
244        5400   male 2009
245        4750 female 2009
246        5650   male 2009
247        4850 female 2009
248        5200   male 2009
249        4925   male 2009
250        4875 female 2009
251        4625 female 2009
252        5250   male 2009
253        4850 female 2009
254        5600   male 2009
255        4975 female 2009
256        5500   male 2009
257        4725   <NA> 2009
258        5500   male 2009
259        4700 female 2009
260        5500   male 2009
261        4575 female 2009
262        5500   male 2009
263        5000 female 2009
264        5950   male 2009
265        4650 female 2009
266        5500   male 2009
267        4375 female 2009
268        5850   male 2009
269        4875   <NA> 2009
270        6000   male 2009
271        4925 female 2009
272          NA   <NA> 2009
273        4850 female 2009
274        5750   male 2009
275        5200 female 2009
276        5400   male 2009
277        3500 female 2007
278        3900   male 2007
279        3650   male 2007
280        3525 female 2007
281        3725   male 2007
282        3950 female 2007
283        3250 female 2007
284        3750   male 2007
285        4150 female 2007
286        3700   male 2007
287        3800 female 2007
288        3775   male 2007
289        3700 female 2007
290        4050   male 2007
291        3575 female 2007
292        4050   male 2007
293        3300   male 2007
294        3700 female 2007
295        3450 female 2007
296        4400   male 2007
297        3600 female 2007
298        3400   male 2007
299        2900 female 2007
300        3800   male 2007
301        3300 female 2007
302        4150   male 2007
303        3400 female 2008
304        3800   male 2008
305        3700 female 2008
306        4550   male 2008
307        3200 female 2008
308        4300   male 2008
309        3350 female 2008
310        4100   male 2008
311        3600   male 2008
312        3900 female 2008
313        3850 female 2008
314        4800   male 2008
315        2700 female 2008
316        4500   male 2008
317        3950   male 2008
318        3650 female 2008
319        3550   male 2008
320        3500 female 2008
321        3675 female 2009
322        4450   male 2009
323        3400 female 2009
324        4300   male 2009
325        3250   male 2009
326        3675 female 2009
327        3325 female 2009
328        3950   male 2009
329        3600 female 2009
330        4050   male 2009
331        3350 female 2009
332        3450   male 2009
333        3250 female 2009
334        4050   male 2009
335        3800   male 2009
336        3525 female 2009
337        3950   male 2009
338        3650 female 2009
339        3650 female 2009
340        4000   male 2009
341        3400 female 2009
342        3775   male 2009
343        4100   male 2009
344        3775 female 2009

Notice, that “penguin_data_from_base_r” now appears in the Environment tab of the upper right pane of the RStudio window. We can use this tab to quickly check the data we’ve loaded into R.

While the read.csv() function gets the job done, it doesn’t do a lot to format the data.

2.2 The readr package

dplyr logo

As the name implies, the readr package contains functions designed to help us read tabular data from text files. These functions a lot of useful features to mark and handle problems we’re likely to encounter in real-world data (we’ll see some examples of this in later lessons).

To read this csv file, we’re going to use the read_csv() function from the readr package. The command is almost identical to base R, except we have an underscore in read_csv(), instead of a period.

penguin_data_from_readr <- read_csv(file = here::here("DATA/penguins.csv"))
Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The readr function includes some additional output that tells us about guesses it’s making about data in the file. Specifically, it determined the species, island, and sex columns contain text (or “characters”), while the remaining columns contain numbers with decimal points (or “doubles”).

Compare the the contents of the penguin_data_from_readr to the penguin_data_from_base_r data we loaded above:

penguin_data_from_readr
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

readr functions read data from files and store them as tibbles, a special version of a data frame. We saw an example of a tibble when we worked with the penguins data frame, loaded by the palmerpenguins package.

3 Transforming penguins

3.1 The dplyr Package

dplyr logo

The dplyr package comes with many functions for manipulating and extracting information from tabular data. As we’ll see below, dplyr functions are named after verbs that describe what we’re doing to the input data, and the first argument of every function is the input data frame (or tibble).

For simplicity, we’ll return to using the penguins data frame for the remainder of this lesson. While we could use dplyr functions to work with the data we read from penguins.csv, the penguins data frame has some nicer formatting.

3.2 Filter

We can use the filter() function to grab rows from our data that contain specific information. Here, we extract just those rows containing measurements from Gentoo penguins.

filter(penguins, species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Remember, the species information in our data frame is contained in the “species” column. This example code is telling R to search through every row in the penguin data, and return those rows that have “Gentoo” in the species column. From this example, we can see the general form for using the filter function: filter(dataset, comparison). We used “==” to indicate we want to find all rows in species that match the word “Gentoo”. This is an example of a relational operator.

Compare Values With Relational Operators

R supports several operators that let us compare values:

  • == : Check if two values are exactly equal. Many programming languages use the double equals sign to indicate comparisons, because they’re already using the single equal sign for something else (e.g. assignment).
  • <, > : Less-than, and greater-than comparisons.
  • <=, >= : Less-than or equal, and greater-than or equal comparisons.
  • != : Check if two values are not equal.

These operators return a logical value: TRUE or FALSE.

"Gentoo" == "Gentoo"
[1] TRUE
121 < 43
[1] FALSE

We can also combine multiple filtering conditions in the same command. In this example, we want to get the rows containing data from female Adelie penguins.

filter(penguins,
       sex == "female",
       species == "Adelie")
# A tibble: 73 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Alternatively, we can combine multiple conditions with the & symbol (meaning “and”) and the | symbol (meaning “or”). We can re-write the previous filter command using the & symbol:

filter(penguins,
       sex == "female" & species == "Adelie")
# A tibble: 73 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We can use the | symbol to retrieve data from female penguins that are either Adelie or Chinstrap:

filter(penguins,
       sex == "female",
       species == "Adelie" | species == "Chinstrap")
# A tibble: 107 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 97 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Let’s use the filter function to create a new data frame that just contains rows from our penguin data from Gentoo penguins.

gentoo_penguin_data <- filter(penguins, )

3.3 Select

With the filter() function, we can choose which rows we want to extract from our data. If we want to choose which columns to extract, we use the select() function. Here we extract the columns containing the species, flipper length, and body mass from each penguin in the dataset.

select(penguins, species, flipper_length_mm, body_mass_g)
# A tibble: 344 × 3
   species flipper_length_mm body_mass_g
   <fct>               <int>       <int>
 1 Adelie                181        3750
 2 Adelie                186        3800
 3 Adelie                195        3250
 4 Adelie                 NA          NA
 5 Adelie                193        3450
 6 Adelie                190        3650
 7 Adelie                181        3625
 8 Adelie                195        4675
 9 Adelie                193        3475
10 Adelie                190        4250
# ℹ 334 more rows

From this example code, we can see the general form of the select() function: select(dataset, column_name1, column_name2, ...). The select() function is very useful for reducing our dataset to just the columns we need for a particular calculation or analysis. This is critical when we’re working with input data that have 100s of columns.

Above, we created a data frame that only contains data from Gentoo penguins. Now let’s use the select() function on that data frame to extract the columns containing the species, flipper length, and body mass measurements. We’ll save the selected data frame in a new variable.

gentoo_body_and_flipper <- select(gentoo_penguin_data, species, )

3.4 Mutate

Cartoon representation of mutate operation

Artwork by @allison_horst

If we want to add new columns to a data frame, We use the mutate() function. Here, we add a new column which contains the body mass of each penguins in kilograms (the “body_mass_g” column is in grams).

mutate(penguins,
       body_mass_kg = body_mass_g / 1000)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>

From this code, we can see the general form of the mutate() function: mutate(dataset, new_column_name = expression). In this example, used the “/” operator to indicate we want to divide penguin body mass in grams by 1000, to calculate the body mass in kilograms. This is an example of an arithmetic operator.

Arithmetic Operators

R supports several operators that allow us to perform various mathematical operations:

  • + addition
  • - subtraction
  • * multiplication
  • / division
  • ^ exponentiation

When we use these operators on the column of a data frame, they’re designed to work separately on each value in the column (called an “element wise” operation).

Note that mutate() adds new columns to the right side of the data frame. If we want to add new columns in different locations, we can use the .before and .after arguments.

mutate(penguins,
       body_mass_kg = body_mass_g / 1000,
       .after = body_mass_g)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_kg <dbl>, sex <fct>, year <int>

We can provide .after and .before with either a column name (like we did above), or a number referring to the position in the table we want to insert the new column. Here we insert the new column before the current second column:

mutate(penguins,
       body_mass_kg = body_mass_g / 1000,
       .before = 2)
# A tibble: 344 × 9
   species body_mass_kg island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>          <dbl> <fct>              <dbl>         <dbl>             <int>
 1 Adelie          3.75 Torgersen           39.1          18.7               181
 2 Adelie          3.8  Torgersen           39.5          17.4               186
 3 Adelie          3.25 Torgersen           40.3          18                 195
 4 Adelie         NA    Torgersen           NA            NA                  NA
 5 Adelie          3.45 Torgersen           36.7          19.3               193
 6 Adelie          3.65 Torgersen           39.3          20.6               190
 7 Adelie          3.62 Torgersen           38.9          17.8               181
 8 Adelie          4.68 Torgersen           39.2          19.6               195
 9 Adelie          3.48 Torgersen           34.1          18.1               193
10 Adelie          4.25 Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Let’s use the mutate() function to add a body mass (mg) column to the data frame of Gentoo data we’ve been working on so far. We’ll save this expanded data frame in a new variable.

gentoo_body_kg_and_flipper <- mutate(gentoo_body_and_flipper,)

3.5 Pipes in R

R has a functionality allowing us to take the output of one function and provide it as input to another. The general name for this type of operation is “piping”. The pipe operator in R is |>. Here we use the filter() function, the R pipe (|>), and the head() function to view the first six rows returned by the filter function.

filter(penguins, species == "Gentoo") |> head()
# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe           46.1          13.2               211        4500
2 Gentoo  Biscoe           50            16.3               230        5700
3 Gentoo  Biscoe           48.7          14.1               210        4450
4 Gentoo  Biscoe           50            15.2               218        5700
5 Gentoo  Biscoe           47.6          14.5               215        5400
6 Gentoo  Biscoe           46.5          13.5               210        4550
# ℹ 2 more variables: sex <fct>, year <int>

In this R code, we use the filter() command to extract all of the rows from the data containing measurements from Gentoo penguins. We then use the |> operator to send the output of the filter() function to the head() function. The head() function returns the first 6 rows from its input data frame.

RStudio’s pipe shortcut

When working in RStudio, we can use the shortcut Ctrl/Cmd + Shift + M to enter the pipe.

Over the last three sections we used the filter(), select(), and mutate functions to create this data frame:

head(gentoo_body_kg_and_flipper)
# A tibble: 6 × 1
  species
  <fct>  
1 Adelie 
2 Adelie 
3 Adelie 
4 Adelie 
5 Adelie 
6 Adelie 

We saved each of the intermediates to their own variables (take a look in the Environment tab to see the list of variables). Alternatively, we can use the pipe to generate the same data frame without saving any of the intermediate results:

penguins |> 
    filter(species == "Gentoo") |> 
    select(species, flipper_length_mm, body_mass_g) |> 
    mutate(body_mass_mg = body_mass_g * 1000,
           .before = body_mass_g) |> 
    head()
# A tibble: 6 × 4
  species flipper_length_mm body_mass_mg body_mass_g
  <fct>               <int>        <dbl>       <int>
1 Gentoo                211      4500000        4500
2 Gentoo                230      5700000        5700
3 Gentoo                210      4450000        4450
4 Gentoo                218      5700000        5700
5 Gentoo                215      5400000        5400
6 Gentoo                210      4550000        4550

With the pipe operator, we can combine many simple R functions to create complex pipelines, all while keeping our code readable.

Under the hood, the |> operator is taking the output of the function on its left and feeding it into the first argument of the function on its right. All dplyr functions are fully compatible with the pipe operator (the first argument of every function is the input data frame).

Another Pipe Operator: %>%

The |> pipe operator is a relatively recent (May 2021) addition to base R. Before that, we needed to use the %>% operator, also called the “magrittr pipe.” This operator is still around and used in a lot of existing R code, but it requires us to load the magrittr package. In these lessons, we’ll only use the native |> pipe operator, so we don’t need to load any extra packages.

3.6 Arrange

We can use the arrange() function to sort the rows in our data according to the values in one or more columns. Here we sort the penguins by bill depth:

arrange(penguins, bill_depth_mm)
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           42.9          13.1               215        5000
 2 Gentoo  Biscoe           46.1          13.2               211        4500
 3 Gentoo  Biscoe           44.9          13.3               213        5100
 4 Gentoo  Biscoe           43.3          13.4               209        4400
 5 Gentoo  Biscoe           46.5          13.5               210        4550
 6 Gentoo  Biscoe           42            13.5               210        4150
 7 Gentoo  Biscoe           44            13.6               208        4350
 8 Gentoo  Biscoe           40.9          13.7               214        4650
 9 Gentoo  Biscoe           45.5          13.7               214        4650
10 Gentoo  Biscoe           42.6          13.7               213        4950
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

By default, arrange() sorts values from smallest to largest (ascending). We can use the desc() function inside arrange to sort values from largest to smallest (descending).

arrange(penguins, desc(bill_depth_mm))
# A tibble: 344 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Adelie    Torgers…           46            21.5               194        4200
 2 Adelie    Torgers…           38.6          21.2               191        3800
 3 Adelie    Dream              42.3          21.2               191        4150
 4 Adelie    Torgers…           34.6          21.1               198        4400
 5 Adelie    Dream              39.2          21.1               196        4150
 6 Adelie    Biscoe             41.3          21.1               195        4400
 7 Chinstrap Dream              54.2          20.8               201        4300
 8 Adelie    Torgers…           42.5          20.7               197        4500
 9 Adelie    Biscoe             39.6          20.7               191        3900
10 Chinstrap Dream              52            20.7               210        4800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Lastly, we can sort data based on multiple columns:

arrange(penguins, island, desc(bill_depth_mm))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           41.3          21.1               195        4400
 2 Adelie  Biscoe           39.6          20.7               191        3900
 3 Adelie  Biscoe           45.6          20.3               191        4600
 4 Adelie  Biscoe           41            20                 203        4725
 5 Adelie  Biscoe           37.8          20                 190        4250
 6 Adelie  Biscoe           38.2          20                 190        3900
 7 Adelie  Biscoe           42            19.5               200        4050
 8 Adelie  Biscoe           42.2          19.5               197        4275
 9 Adelie  Biscoe           35.9          19.2               189        3800
10 Adelie  Biscoe           37.6          19.1               194        3750
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

3.7 Distinct

The distinct() function returns all unique rows from the data frame. We can provide the names of the columns we want to search for unique combinations. Here we want to find all unique combinations of species and island.

distinct(penguins, species, island)
# A tibble: 5 × 2
  species   island   
  <fct>     <fct>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream    

If we don’t specify any column names, the distinct() function will look for unique combinations across all columns.

distinct(penguins)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

3.8 Grouping and Summarizing Data

We started transforming these data because we wanted to extract some summary stats about the body mass and flipper length of the three penguin species in our data. With the functions we’ve used so far, we can extract the data we need for a specific penguin species. Here, we use the summarize() function to calculate the mean mean body mass and flipper length across all Gentoo penguins in our data.

penguins |> 
    filter(species == "Gentoo") |> 
    select(species, flipper_length_mm, body_mass_g) |> 
    # na.omit() |> # Filter out any rows containing NA values in any columns
    summarize(mean_body_mass_g = mean(body_mass_g),
              mean_flipper_length_mm = mean(flipper_length_mm))
# A tibble: 1 × 2
  mean_body_mass_g mean_flipper_length_mm
             <dbl>                  <dbl>
1               NA                     NA

In this example we use dplyr functions and the pipe operator to filter our data for Gentoo penguins, select our columns of interest (body_mass_g and flipper_length_mm), and the summarize() function (also from dplyr) to calculate the mean values across data in the body_mass_g and flipper_length_mm columns.

Missing Data

The first time we ran the code above, we didn’t use the na.omit() function and our mean body mass and flipper length calculations returned ‘NA’ values. ‘NA’ is one of the ways R represents missing data, and it turns out one of the penguins in our dataset has ‘NA’ values for all of its measurements (you can find it by looking through the data with the View() function). Many function that perform mathematical operations (like mean), will return an ‘NA’ value if any of its inputs are ‘NA’. This is so we’re aware there are ‘NA’ values present in our data and can handle them accordingly. Once we realized there was a single ‘NA’ values in our data, we excluded it using the na.omit() function.

From this code, we see the general form of the summarize() function: summarize(dataset, column_name = expression). This is quite similar to the mutate() function. However, while the mutate() function performs a calculation for each row in a data frame column, the summarize() function performs one calculation using all of the data in a column.

Using the summarize() function we can quickly calculate summary stats from the columns in a data frame. In order to get the same summary stats for the other penguin data, we’d need to repeat the same set of operations two more times (one for each species). Ideally, we want to be able to work on data from all three penguin species at the same time.

We can do this with the group_by() function.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    # This time we're using arguements in the mean function to remove the NA values
    summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
              mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 3
  species   mean_body_mass_g mean_flipper_length_mm
  <fct>                <dbl>                  <dbl>
1 Adelie               3701.                   190.
2 Chinstrap            3733.                   196.
3 Gentoo               5076.                   217.

Here we use the group_by() function to group the data according to the species column, before using the summarize() function. This grouping causes the summarize() function to perform calculations across the data within each group (species, in this case), rather than across the entire data frame.

The means don’t give us the whole picture, so let’s calculate the standard deviations for each of these measurements, as well as the total number of penguins from each species.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    na.omit() |>
    summarize(mean_body_mass_g = mean(body_mass_g),
              sd_body_mass_g = sd(body_mass_g),
              mean_flipper_length_mm = mean(flipper_length_mm),
              sd_mean_flipper_length_mm = sd(flipper_length_mm),
              Total_animals = n())

It looks like the raw numbers agree with what we saw in the figure we generated in the previous lesson. Namely, the Gentoo penguins tend to have more mass and longer flippers than the other two species. And while the Chinstrap penguins have higher mean body mass and flipper length than the Adélie penguins, the standard deviations in these measurements are large enough that there probably isn’t a significant difference in size between the two. In the coming lessons, we’ll apply some statistical tests to these data to test our hypotheses.

Not all operations are equivalent

So far, we’ve seen two ways of keeping ‘NA’ values from affecting our calculations: 1. The na.omit() function removes all rows from a data frame that contain ‘NA’ values in any column. 2. The mean() and sd() have an na.rm argument that excludes all ‘NA’ values from the mean / standard deviation calculations when we set it to TRUE (na.rm = TRUE).

We used the na.omit() function to exclude the ‘NA’ values before we used the summarize() function to calculated all of our summary statistics above. However, if we skip the na.omit() function and instead use the “na.rm” argument for mean() and sd() to exclude the ‘NA’ values, we get a slightly different result.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    # na.omit() |> 
    summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
              sd_body_mass_g = sd(body_mass_g, na.rm = TRUE),
              mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
              sd_mean_flipper_length_mm = sd(flipper_length_mm, na.rm = TRUE),
              Total_animals = n())
# A tibble: 3 × 6
  species   mean_body_mass_g sd_body_mass_g mean_flipper_length_mm
  <fct>                <dbl>          <dbl>                  <dbl>
1 Adelie               3701.           459.                   190.
2 Chinstrap            3733.           384.                   196.
3 Gentoo               5076.           504.                   217.
# ℹ 2 more variables: sd_mean_flipper_length_mm <dbl>, Total_animals <int>

Compare these results to the previous code using na.omit(), paying close attention to the “Total_animals” column. If there are too many columns in the results to compare them easily, you could always use the select() function to grab just the “species” and “Total_animals” columns.

When we used the na.rm argument approach, we ended with with one extra penguin in the “Total_animals” column for the Adélie and Gentoo penguins. This is because the n() function counts rows, regardless of their contents (you can confirm there’s no na.rm argument for n() using the R docs). While there are many different ways to accomplish the same task, they are not all equivalent in all cases. If we weren’t also using the n() function to count the total number of penguins in each species, both of our methods for removing the ‘NA’ values would have produced the same result.

Even though this is a toy example, we’ve created a flexible analysis pipeline by combining these dplyr functions, The code we’ve written will still work if we collect new data from different penguin species, add additional biological measurements beyond body mass and flipper length, or remove some of the rows from the original input data.

4 R session information

Here we report the version number for R and the package versions we used to perform the analyses in this document.

sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.2          readr_2.1.4          palmerpenguins_0.1.1
[4] here_1.0.1          

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.3       cli_3.6.1         knitr_1.43       
 [5] rlang_1.1.1       xfun_0.39         generics_0.1.3    jsonlite_1.8.7   
 [9] bit_4.0.5         glue_1.6.2        rprojroot_2.0.3   htmltools_0.5.5  
[13] hms_1.1.3         fansi_1.0.4       rmarkdown_2.23    evaluate_0.21    
[17] tibble_3.2.1      tzdb_0.4.0        fastmap_1.1.1     yaml_2.3.7       
[21] lifecycle_1.0.3   compiler_4.3.1    htmlwidgets_1.6.2 pkgconfig_2.0.3  
[25] rstudioapi_0.15.0 digest_0.6.33     R6_2.5.1          tidyselect_1.2.0 
[29] utf8_1.2.3        parallel_4.3.1    vroom_1.6.3       pillar_1.9.0     
[33] magrittr_2.0.3    withr_2.5.0       bit64_4.0.5       tools_4.3.1      
Back to top